{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://www.youtube.com/watch?v=vrDdnQfav0s&list=PLxqBkZuBynVTn2lkHNAcw6lgm1MD5QiMK&index=21\"><h1 style=\"font-size:250%; font-family:cursive; color:#ff6666;\"><b>Link YouTube Video - LDA for Topic Modeling</b></h1></a>\n",
    "\n",
    "[![IMAGE ALT TEXT](https://imgur.com/yXOAaIY.png)](https://www.youtube.com/watch?v=vrDdnQfav0s&list=PLxqBkZuBynVTn2lkHNAcw6lgm1MD5QiMK&index=21 \"\")\n",
    "\n",
    "---------------------------"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the well known topic modelling techniques are\n",
    "\n",
    "* Latent Semantic Analysis (LSA)\n",
    "* Probabilistic Latent Semantic Analysis (PLSA)\n",
    "* Latent Dirichlet Allocation (LDA)\n",
    "* Correlated Topic Model (CTM)\n",
    "\n",
    "In this video, we will do LDA\n",
    "\n",
    "Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.\n",
    "\n",
    "The LDA makes two key assumptions:\n",
    "\n",
    "- Documents are a mixture of topics, and\n",
    "\n",
    "- Topics are a mixture of tokens (or words)\n",
    "\n",
    "So the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.\n",
    "\n",
    "## How will LDA optimize the distributions?\n",
    "\n",
    "The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word distribution.\n",
    "\n",
    "As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA backtracks from the document level to identify which topics would have generated these documents and which words would have generated those topics.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Newsgroups data\n",
    "\n",
    "As before, let's consider a specific set of categories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pyLDAvis -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "def load_dataset(sset, categories):\n",
    "    \"\"\"\n",
    "    Function to load 20 newsgroups dataset from sklearn. The dataset is a collection of approximately \n",
    "    20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.\n",
    "    \n",
    "    Parameters:\n",
    "    sset (str): The subset of the dataset to load. This can be 'train', 'test', or 'all' \n",
    "                to load the training set, the test set, or all dataset.\n",
    "    \n",
    "    categories (list of str): The list of categories (newsgroups) to load. If it's an empty list, \n",
    "                              all categories will be loaded.\n",
    "    \n",
    "    Returns:\n",
    "    newsgroups_dset (sklearn.utils.Bunch): The loaded dataset. It's a dict-like object with the following \n",
    "                                           attributes:\n",
    "                                           - data: the text data to learn\n",
    "                                           - target: the classification labels\n",
    "                                           - target_names: the meaning of the labels\n",
    "                                           - DESCR: the full description of the dataset.\n",
    "\n",
    "    \"\"\"\n",
    "    \n",
    "    if categories==[]:\n",
    "        newsgroups_dset = fetch_20newsgroups(subset=sset,\n",
    "                          remove=('headers', 'footers', 'quotes'),\n",
    "                          shuffle=True)\n",
    "    else:\n",
    "        newsgroups_dset = fetch_20newsgroups(subset=sset, categories=categories,\n",
    "                          remove=('headers', 'footers', 'quotes'),\n",
    "                          shuffle=True)\n",
    "    return newsgroups_dset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The 20 Newsgroups dataset is a collection of around 18,000 newsgroups posts on 20 “topics”,\n",
    "\n",
    "---------------\n",
    "\n",
    "### Define the list of categories to extract from the data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9850\n"
     ]
    }
   ],
   "source": [
    "categories = [\"comp.windows.x\", \"misc.forsale\", \"rec.autos\", \"rec.motorcycles\", \"rec.sport.baseball\", \"rec.sport.hockey\", \"sci.crypt\", \"sci.med\", \"sci.space\", \"talk.politics.mideast\"]\n",
    "\n",
    "newsgroups_all = load_dataset('all', categories) # To access both training and test sets, use “all” as the first argument\n",
    "\n",
    "\n",
    "print(len(newsgroups_all.data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.stem import SnowballStemmer\n",
    "\n",
    "stemmer = SnowballStemmer('english')\n",
    "\n",
    "def stem(text):\n",
    "    return stemmer.stem(text)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gensim\n",
    "from gensim.utils import simple_preprocess\n",
    "from gensim.parsing.preprocessing import STOPWORDS as stopwords\n",
    "\n",
    "\n",
    "def preprocess(text):\n",
    "    result = []\n",
    "    for token in gensim.utils.simple_preprocess(text, min_len = 4 ):\n",
    "        if token not in stopwords:\n",
    "            result.append(stem(token))\n",
    "            \n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original document: \n",
      "Hi Xperts!\n",
      "\n",
      "How can I move the cursor with the keyboard (i.e. cursor keys), \n",
      "if no mouse is available?\n",
      "\n",
      "Any hints welcome.\n",
      "\n",
      "Thanks.\n"
     ]
    }
   ],
   "source": [
    "doc_sample = newsgroups_all.data[0]\n",
    "\n",
    "print('Original document: ')\n",
    "print(doc_sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "Tokenized document: \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Hi',\n",
       " 'Xperts',\n",
       " 'How',\n",
       " 'can',\n",
       " 'I',\n",
       " 'move',\n",
       " 'the',\n",
       " 'cursor',\n",
       " 'with',\n",
       " 'the',\n",
       " 'keyboard',\n",
       " 'i',\n",
       " 'e',\n",
       " 'cursor',\n",
       " 'keys',\n",
       " 'if',\n",
       " 'no',\n",
       " 'mouse',\n",
       " 'is',\n",
       " 'available',\n",
       " 'Any',\n",
       " 'hints',\n",
       " 'welcome',\n",
       " 'Thanks']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('\\n\\nTokenized document: ')\n",
    "words = []\n",
    "\n",
    "for token in gensim.utils.tokenize(doc_sample):\n",
    "    words.append(token)\n",
    "    \n",
    "words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['xpert', 'cursor', 'keyboard', 'cursor', 'key', 'mous', 'avail', 'hint', 'welcom', 'thank']\n"
     ]
    }
   ],
   "source": [
    "print(preprocess(doc_sample))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\txpert, cursor, keyboard, cursor, key, mous, avail, hint, welcom, thank\n",
      "1\tobtain, copi, open, look, widget, obtain, need, order, copi, thank\n",
      "2\tright, signal, strong, live, west, philadelphia, perfect, sport, fan, dream\n",
      "3\tcanadian, thing, coach, boston, bruin, colorado, rocki, summari, post, gather\n",
      "4\theck, feel, like, time, includ, cafeteria, work, half, time, headach\n",
      "5\tdamn, right, late, climb, meet, morn, bother, right, foot, asleep\n",
      "6\tolympus, stylus, pocket, camera, smallest, class, includ, time, date, stamp\n",
      "7\tinclud, follow, chmos, clock, generat, driver, processor, chmos, eras, prom\n",
      "8\tchang, intel, discov, xclient, xload, longer, work, bomb, messag, error\n",
      "9\ttermin, like, power, server, run, window, manag, special, client, program\n"
     ]
    }
   ],
   "source": [
    "for i in range(0, 10):\n",
    "    print(str(i) + \"\\t\" + \", \".join(preprocess(newsgroups_all.data[i])[:10] ))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9850"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "processed_docs = []\n",
    "\n",
    "for i in range(0, len(newsgroups_all.data)):\n",
    "    processed_docs.append(preprocess(newsgroups_all.data[i]))\n",
    "    \n",
    "len(processed_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "39350"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dictionary = gensim.corpora.Dictionary(processed_docs)\n",
    "\n",
    "len(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 avail\n",
      "1 cursor\n",
      "2 hint\n",
      "3 key\n",
      "4 keyboard\n",
      "5 mous\n",
      "6 thank\n",
      "7 welcom\n",
      "8 xpert\n",
      "9 copi\n"
     ]
    }
   ],
   "source": [
    "index = 0\n",
    "\n",
    "for key, value in dictionary.iteritems():\n",
    "    print(key, value)\n",
    "    index +=1\n",
    "    if index > 9:\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5868"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000 )\n",
    "\n",
    "len(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9850"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs ]\n",
    "\n",
    "len(bow_corpus)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```\n",
    "[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],\n",
    " [(6, 1),\n",
    "  (9, 2),\n",
    "  (10, 1),\n",
    "  (11, 1),\n",
    "  (12, 1),\n",
    "  (13, 1),\n",
    "  (14, 2),\n",
    "  (15, 1),\n",
    "  (16, 1),\n",
    "  (17, 1)],\n",
    "  ....\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bow_corpus[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Key 0 =\"avail\":    occurrences=1\n",
      "Key 1 =\"cursor\":    occurrences=2\n",
      "Key 2 =\"hint\":    occurrences=1\n",
      "Key 3 =\"key\":    occurrences=1\n",
      "Key 4 =\"keyboard\":    occurrences=1\n",
      "Key 5 =\"mous\":    occurrences=1\n",
      "Key 6 =\"thank\":    occurrences=1\n",
      "Key 7 =\"welcom\":    occurrences=1\n",
      "Key 8 =\"xpert\":    occurrences=1\n"
     ]
    }
   ],
   "source": [
    "bow_doc = bow_corpus[0]\n",
    "\n",
    "for i in range(len(bow_doc)):\n",
    "    print(f\"Key {bow_doc[i][0]} =\\\"{dictionary[bow_doc[i][0]]}\\\":\\\n",
    "    occurrences={bow_doc[i][1]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic: 0 \n",
      "Words: 0.021*\"encrypt\" + 0.018*\"secur\" + 0.018*\"chip\" + 0.016*\"govern\" + 0.013*\"clipper\" + 0.012*\"public\" + 0.010*\"privaci\" + 0.010*\"key\" + 0.010*\"phone\" + 0.009*\"algorithm\"\n",
      "Topic: 1 \n",
      "Words: 0.017*\"appear\" + 0.014*\"copi\" + 0.013*\"cover\" + 0.013*\"star\" + 0.013*\"book\" + 0.011*\"penalti\" + 0.010*\"black\" + 0.009*\"comic\" + 0.008*\"blue\" + 0.008*\"green\"\n",
      "Topic: 2 \n",
      "Words: 0.031*\"window\" + 0.015*\"server\" + 0.012*\"program\" + 0.012*\"file\" + 0.012*\"applic\" + 0.012*\"display\" + 0.011*\"widget\" + 0.010*\"version\" + 0.010*\"motif\" + 0.010*\"support\"\n",
      "Topic: 3 \n",
      "Words: 0.015*\"space\" + 0.007*\"launch\" + 0.007*\"year\" + 0.007*\"medic\" + 0.006*\"patient\" + 0.006*\"orbit\" + 0.006*\"research\" + 0.006*\"diseas\" + 0.005*\"develop\" + 0.005*\"nasa\"\n",
      "Topic: 4 \n",
      "Words: 0.018*\"armenian\" + 0.011*\"peopl\" + 0.008*\"kill\" + 0.008*\"said\" + 0.007*\"turkish\" + 0.006*\"muslim\" + 0.006*\"jew\" + 0.006*\"govern\" + 0.005*\"state\" + 0.005*\"greek\"\n",
      "Topic: 5 \n",
      "Words: 0.024*\"price\" + 0.021*\"sale\" + 0.020*\"offer\" + 0.017*\"drive\" + 0.017*\"sell\" + 0.016*\"includ\" + 0.013*\"ship\" + 0.013*\"interest\" + 0.011*\"ask\" + 0.010*\"condit\"\n",
      "Topic: 6 \n",
      "Words: 0.018*\"mail\" + 0.016*\"list\" + 0.015*\"file\" + 0.015*\"inform\" + 0.013*\"send\" + 0.012*\"post\" + 0.012*\"avail\" + 0.010*\"request\" + 0.010*\"program\" + 0.009*\"includ\"\n",
      "Topic: 7 \n",
      "Words: 0.019*\"like\" + 0.016*\"know\" + 0.011*\"time\" + 0.011*\"look\" + 0.010*\"think\" + 0.008*\"want\" + 0.008*\"thing\" + 0.008*\"good\" + 0.007*\"go\" + 0.007*\"bike\"\n",
      "Topic: 8 \n",
      "Words: 0.033*\"game\" + 0.022*\"team\" + 0.017*\"play\" + 0.015*\"year\" + 0.013*\"player\" + 0.011*\"season\" + 0.008*\"hockey\" + 0.008*\"score\" + 0.007*\"leagu\" + 0.007*\"goal\"\n",
      "Topic: 9 \n",
      "Words: 0.013*\"peopl\" + 0.012*\"think\" + 0.011*\"like\" + 0.009*\"time\" + 0.009*\"right\" + 0.009*\"israel\" + 0.009*\"know\" + 0.006*\"reason\" + 0.006*\"point\" + 0.006*\"thing\"\n"
     ]
    }
   ],
   "source": [
    "#  Initialize id2word to the dictionary where each word stem is mapped to a unique ID\n",
    "id2word = dictionary\n",
    "\n",
    "# Create the corpus with word frequencies\n",
    "corpus = bow_corpus\n",
    "\n",
    "# Build the LDA model\n",
    "lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,\n",
    "                                           id2word=id2word,\n",
    "                                           num_topics=10, \n",
    "                                           random_state=100,\n",
    "                                           update_every=1,\n",
    "                                           chunksize=1000,\n",
    "                                           passes=10,\n",
    "                                           alpha='symmetric',\n",
    "                                           iterations=100,\n",
    "                                           per_word_topics=True)\n",
    "\n",
    "# Output all topics and for each of them print out its index and\n",
    "# the most informative words identified\n",
    "for index, topic in lda_model.print_topics(-1):\n",
    "    print(f\"Topic: {index} \\nWords: {topic}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Meaning of LDA Model Params\n",
    "\n",
    "https://stackoverflow.com/questions/50805556/understanding-parameters-in-gensim-lda-model\n",
    "\n",
    "I wonder if you have seen [this page][1]?\n",
    "\n",
    "As for the other parameters:\n",
    "\n",
    " - `random_state` - this serves as a seed (in case you wanted to repeat exactly the training process)\n",
    "\n",
    " -  `chunksize` - number of documents to consider at once (affects the memory consumption)\n",
    "\n",
    " -  [`update_every`][2] - update the model every `update_every` `chunksize` chunks (essentially, this is for memory consumption optimization)\n",
    "\n",
    " - `passes` - how many times the algorithm is supposed to pass over the whole corpus\n",
    "\n",
    " - `alpha` - to cite the documentation:\n",
    "\n",
    "    > can be set to an explicit array = prior of your choice. It also\n",
    "    > support special values of `‘asymmetric’ and ‘auto’: the former uses a\n",
    "    > fixed normalized asymmetric 1.0/topicno prior, the latter learns an\n",
    "    > asymmetric prior directly from your data.\n",
    "\n",
    " - `per_word_topics` - setting this to `True` allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. `phi_value` is another parameter that steers this process - it is a threshold for a word treated as indicative or not.\n",
    "\n",
    "Optimal training process parameters are described particularly well in [M. Hoffman et al., Online Learning for Latent Dirichlet Allocation][3].\n",
    "\n",
    "For memory optimization of the training process or the model see [this blog post][4].\n",
    "\n",
    "\n",
    "  [1]: https://radimrehurek.com/gensim/models/ldamodel.html\n",
    "  [2]: https://groups.google.com/forum/#!topic/gensim/ojySenxQHi4\n",
    "  [3]: https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation\n",
    "  [4]: https://miningthedetails.com/blog/python/lda/GensimLDA/"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "for i, topic_list in enumerate(lda_model[bow_corpus]):\n",
    "        print(topic_list)\n",
    "        \n",
    "''' \n",
    "([(1, 0.1004754), (2, 0.8267648)], [(0, [2]), (1, [2]), (2, [2]), (3, [2]), (4, [2]), (5, [2]), (6, [2]), (7, [2]), (8, [1])], [(0, [(2, 0.99994975)]), (1, [(2, 1.9998709)]), (2, [(2, 0.999867)]), (3, [(2, 0.99986845)]), (4, [(2, 0.9999427)]), (5, [(2, 0.99766123)]), (6, [(2, 0.9959857)]), (7, [(2, 0.998597)]), (8, [(1, 0.99861616)])])\n",
    "([(2, 0.46435428), (6, 0.47409445)], [(6, [6, 2]), (9, [6, 2]), (10, [6]), (11, [2, 6]), (12, [2, 6]), (13, [2, 6]), (14, [6, 2]), (15, [2, 6]), (16, [2, 6]), (17, [2])], [(6, [(2, 0.4898241), (6, 0.5101444)]), (9, [(2, 0.575622), (6, 1.42426)]), (10, [(6, 0.99996334)]), (11, [(2, 0.6880579), (6, 0.31189892)]), (12, [(2, 0.78450835), (6, 0.21544383)]), (13, [(2, 0.64989084), (6, 0.35005748)]), (14, [(2, 0.39420748), (6, 1.6055375)]), (15, [(2, 0.64683723), (6, 0.35311702)]), (16, [(2, 0.7038416), (6, 0.29598135)]), (17, [(2, 0.99997586)])])\n",
    "...\n",
    "...\n",
    "...\n",
    "\n",
    "And if I print topic_list[0] it will print\n",
    "\n",
    "[(1, 0.1004752), (2, 0.8267664)]\n",
    "[(2, 0.4643499), (6, 0.47409886)]\n",
    "[(7, 0.4230467), (8, 0.4000204), (9, 0.17062153)]\n",
    "\n",
    "And this is the form where each tuple represents\n",
    "\n",
    "(unique_topic_id, topic_contribution )\n",
    "\n",
    "'''\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_topic_param(ldamodel, corpus, texts):\n",
    "    \"\"\"\n",
    "    Function to extract the dominant topic, its percentage contribution, keywords and a snippet of the original text \n",
    "    for each document in the corpus.\n",
    "\n",
    "    Parameters:\n",
    "    ldamodel (gensim.models.ldamodel.LdaModel): The trained LDA model.\n",
    "    corpus (list of list of (int, float)): The corpus used to train the LDA model. Each document is represented \n",
    "                                           as a list of (word id, word frequency) tuples.\n",
    "    texts (list of str): The original text documents.\n",
    "\n",
    "    Returns:\n",
    "    main_topic (dict of int): The dominant topic for each document.\n",
    "    percentage (dict of float): The percentage contribution of the dominant topic in each document.\n",
    "    keywords (dict of str): The keywords for the dominant topic in each document.\n",
    "    text_snippets (dict of str): A snippet of the original text for each document.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Initialize dictionaries to hold the results\n",
    "    main_topic = {}  # to hold the dominant topic for each document\n",
    "    percentage = {}  # to hold the percentage contribution of the dominant topic in each document\n",
    "    keywords = {}  # to hold the keywords for the dominant topic in each document\n",
    "    text_snippets = {}  # to hold a snippet of the original text for each document\n",
    "    \n",
    "    # Iterate over all the documents in the corpus\n",
    "    for i, topic_list in enumerate(ldamodel[corpus]):\n",
    "        # Get the topic distribution for the document\n",
    "        topic = topic_list[0] if ldamodel.per_word_topics else topic_list\n",
    "\n",
    "        # Sort the topics by their contribution to the document\n",
    "        topic = sorted(topic, key = lambda x: (x[1]), reverse = True)\n",
    "        \n",
    "        # Only the dominant topic, its percentage contribution and its keywords are considered\n",
    "        for j, (topic_num, topic_contribution) in enumerate(topic):\n",
    "            if j == 0:  # if this is the dominant topic\n",
    "                # Get the keywords for the topic\n",
    "                wp = ldamodel.show_topic(topic_num)\n",
    "                topic_keywords = \", \".join([word for word, prop in wp[:5]])\n",
    "                \n",
    "                # Store the results in the dictionaries\n",
    "                main_topic[i] = int(topic_num)\n",
    "                percentage[i] = round(topic_contribution, 4)\n",
    "                keywords[i] = topic_keywords\n",
    "                text_snippets[i] = texts[i][:8]  # get a snippet of the original text\n",
    "            else:\n",
    "                break\n",
    "    \n",
    "    # Return the dictionaries\n",
    "    return main_topic, percentage, keywords, text_snippets\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "main_topic, percentage, keywords, text_snippets   = get_topic_param(lda_model, bow_corpus, processed_docs )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " ID  Main Topic  Contribution (%)  Keywords                                Snippet                                                                           \n",
      " 0   2           0.8268            window, server, program, file, applic\n",
      "  ['xpert', 'cursor', 'keyboard', 'cursor', 'key', 'mous', 'avail', 'hint']         \n",
      " 1   6           0.4742            mail, list, file, inform, send\n",
      "         ['obtain', 'copi', 'open', 'look', 'widget', 'obtain', 'need', 'order']           \n",
      " 2   7           0.4231            like, know, time, look, think\n",
      "          ['right', 'signal', 'strong', 'live', 'west', 'philadelphia', 'perfect', 'sport'] \n",
      " 3   8           0.4159            game, team, play, year, player\n",
      "         ['canadian', 'thing', 'coach', 'boston', 'bruin', 'colorado', 'rocki', 'summari'] \n",
      " 4   9           0.9039            peopl, think, like, time, right\n",
      "        ['heck', 'feel', 'like', 'time', 'includ', 'cafeteria', 'work', 'half']           \n",
      " 5   7           0.6291            like, know, time, look, think\n",
      "          ['damn', 'right', 'late', 'climb', 'meet', 'morn', 'bother', 'right']             \n",
      " 6   3           0.3476            space, launch, year, medic, patient\n",
      "    ['olympus', 'stylus', 'pocket', 'camera', 'smallest', 'class', 'includ', 'time']  \n",
      " 7   5           0.3799            price, sale, offer, drive, sell\n",
      "        ['includ', 'follow', 'chmos', 'clock', 'generat', 'driver', 'processor', 'chmos'] \n",
      " 8   2           0.7944            window, server, program, file, applic\n",
      "  ['chang', 'intel', 'discov', 'xclient', 'xload', 'longer', 'work', 'bomb']        \n",
      " 9   2           0.6383            window, server, program, file, applic\n",
      "  ['termin', 'like', 'power', 'server', 'run', 'window', 'manag', 'special']        \n"
     ]
    }
   ],
   "source": [
    "indexes = list(range(10))\n",
    "\n",
    "rows = []\n",
    "\n",
    "rows.append(['ID', 'Main Topic', 'Contribution (%)', 'Keywords', 'Snippet'])\n",
    "\n",
    "for idx in indexes:\n",
    "    \n",
    "    rows.append([ str(idx), f\"{main_topic.get(idx)}\",\n",
    "                 f\"{percentage.get(idx):.4}\",\n",
    "                 f\"{keywords.get(idx)}\\n\",\n",
    "                f\"{text_snippets.get(idx)}\"                 \n",
    "                 ])\n",
    "    \n",
    "columns = zip(*rows)\n",
    "\n",
    "column_width = [max(len(item) for item in col) for col in columns ]\n",
    "\n",
    "for row in rows:\n",
    "    print(''.join(' {:{width}} '.format(row[i], width = column_width[i]) for i in range(0, len(row)) ))\n",
    "    \n",
    "    \n",
    "    "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pyLDAvis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyLDAvis.gensim_models\n",
    "pyLDAvis.enable_notebook()\n",
    "vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary = lda_model.id2word )\n",
    "vis"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Each bubble represents a topic. The larger the bubble, the higher percentage of that text in the corpus is about that topic.\n",
    "\n",
    "* Blue bars represent the overall frequency of each word in the corpus. If no topic is selected, the blue bars of the most frequently used words will be displayed.\n",
    "\n",
    "* Red bars give the estimated number of times a given term was generated by a given topic. \n",
    "\n",
    "* The further the bubbles are away from each other, the more different they are."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.14 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.14"
  },
  "vscode": {
   "interpreter": {
    "hash": "36cf16204b8548560b1c020c4e8fb5b57f0e4c58016f52f2d4be01e192833930"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
